Back

Frontiers in Genetics

Frontiers Media SA

Preprints posted in the last 30 days, ranked by how well they match Frontiers in Genetics's content profile, based on 197 papers previously published here. The average preprint has a 0.33% match score for this journal, so anything above that is already an above-average fit.

1
Genomic indicators of gene function: A systematic assessment of the human genome

Cooper, H. B.; Rojas Lopez, K. E.; Schiavinato, D.; Black, M. A.; Gardner, P. P.

2026-04-09 genomics 10.64898/2026.04.08.717348 medRxiv
Top 0.1%
18.2%
Show abstract

Proteins and non-coding RNAs are functional products of the genome that are central for crucial cellular processes. With recent technological advances, researchers can sequence genomes in the thousands and probe numerous genomic activities of many species and conditions. Such studies have identified thousands of potential proteins, RNAs and associated activities. However there are conflicting interpretations of the results and therefore which regions of the genome are "functional". Here we investigate the relative strengths of associations between coding and non-coding gene functionality and genomic features, by comparing reliably annotated functional genes to non-genic regions of the genome. We find that the strongest and most consistent association between functional genes and genomic features are transcriptional activity and evolutionary conservation. We also evaluated sequence-based statistics, genomic repeats, epigenetic and population variation data. Other features strongly associated with function include histone marks, chromatin accessibility, genomic copy-number, and sequence alignment statistics such as coding potential and covariation. We also identify potential issues with SNP annotations in short non-coding RNAs, as some highly conserved ncRNAs have significantly higher than expected SNP densities. Our results demonstrate the importance of evolutionary conservation and transcription activity for indicating protein-coding and non-coding gene function. Both should be taken into consideration when differentiating between functional sequences and biological or experimental noise.

2
A Bayesian multidimensional approach to decipher the genetic basis of dynamic phenotypes in multiple species

Blois, L.; Heuclin, B.; Bernard, A.; Denis, M.; Dirlewanger, E.; Foulongne-Oriol, M.; Marullo, P.; Peltier, E.; Quero-Garcia, J.; Marguerit, E.; Gion, J.-M.

2026-04-03 genetics 10.64898/2026.04.01.715770 medRxiv
Top 0.1%
15.0%
Show abstract

Deciphering the genetic architecture of complex quantitative phenotypes remains challenging in quantitative genetics. These traits not only depend of multiple genetic factors but are also established over time and environments. Although quantitative genetics has investigated the genetic determinism of phenotypic plasticity in contrasted environmental conditions, the time related phenotypic plasticity has received less attention. Here we proposed a multivariate Bayesian framework, the Bayesian Varying Coefficient Model, designed for analysing the genetic architecture of the time related phenotypic plasticity by a multilocus approach. We applied the BVCM to time series phenotypes measured at various time scales (daily, monthly, yearly) across a diverse set of biological species. We included in this study: yeast (Saccharomyces cerevisiae), fungi (Fusarium graminearum), eucalyptus (Eucalyptus urophylla x E. grandis), and sweet cherry tree (Prunus avium). The BVCM results were compared with those obtained with a known genome-wide association method carried out time by time. For all species and traits, the BVCM was able to detect the major QTL identified by marker-trait association methods and revealed additional genetic regions of weak effect. It also increased the phenotypic variance explained for most of the phenotypes considered. It revealed dynamic QTLs with transitory, increasing or decreasing effects over time. By considering both the temporal and genetic multivariate structures in a single statistical model, we increased our understanding of the genetic architecture of complex traits notably by reducing the issue of missing heritability. More broadly, this work raises the foundation for extended applications in functional genomics, evolutionary ecology, and crop breeding programs, in which time-related phenotypic plasticity remains crucial for predicting and selecting key quantitative complex traits. Key messageBy capturing the genetic factors influencing the time related phenotypic plasticity, our approach contributes to a deeper understanding of the dynamic nature of genotype-phenotype relationships.

3
Transposable elements as new players to decipher sex differences in Parkinson Disease

Gordillo-Gonzalez, F.; Galiana-Rosello, C.; Grillo-Risco, R.; Soler-Saez, I.; Hidalgo, M. R.; Siomi, H.; Kobayashi-Ishihara, M.; Garcia-Garcia, F.

2026-03-30 bioinformatics 10.64898/2026.03.27.714370 medRxiv
Top 0.1%
14.5%
Show abstract

We present a novel integrative analysis of transposable elements (TEs) in 4 single cell RNA-seq (scRNA-seq) datasets of postmortem substantia nigra pars compacta samples of Parkinson Disease (PD) patients matched healthy controls, with the objective of building a cell-type specific trustworthy atlas of TEs that may clarify the role of TEs in sex differences in PD. We have used the soloTE tool to evaluate the TEs expression changes across all snRNA-seq studies identified in our previous systematic review, and then integrated the results using meta-analysis techniques. Finally, we evaluated the possible associations between TEs and protein coding genes by integrating our previous results in this matter with the information of TEs obtained, in order to propose the possible action mechanism by which some of the TEs contribute to PD.

4
Integrative Identification and Characterization of PCOS-Associated lncRNAs From the Interface of Genetic Association, Transcriptomics, and Gene Structure Evolution

He, Z.; Li, Y.; Shkurat, T. P.; Butenko, E. V.; Derevyanchuk, E. G.; Lomteva, S. V.; Chen, L.; Lipovich, L.

2026-04-02 genomics 10.64898/2026.03.31.715548 medRxiv
Top 0.2%
10.5%
Show abstract

BackgroundPolycystic ovary syndrome (PCOS) is a prevalent endocrine disorder and a leading cause of female infertility, with complex genetic, metabolic, and hormonal etiologies. Long non-coding RNAs (lncRNAs) have emerged as important regulators of diverse biological processes, yet their roles in PCOS remain underexplored. Here, we identified and characterized PCOS differentially expressed gene-associated lncRNAs (PDEGAL) with an integrative approach combining expression data, genetic association, and evolutionary analysis. MethodsThirty-three PCOS-associated protein-coding genes were obtained from our prior study, and all their nearby and overlapping lncRNAs were annotated. These candidates were analyzed using UCSC Genome Browser-mapped annotations and datasets, including NCBI RefSeq, GENCODE, GTEx, GWAS SNPs, and conservation, as well as the FANTOM5 cap analysis of gene expression (CAGE) promoter data, to assess their expression, regulatory potential, genetic variant overlaps, and evolutionary conservation. ResultsTwenty-three PDEGALs (18 antisense to, and 5 sharing bidirectional promoters with, known PCOS-associated protein-coding genes) were identified. 17 PDEGALs contained GWAS SNPs with statistically significant disease associations, 9 of which were associated with PCOS-related traits. 5 PDEGALs demonstrated expression in the KGN granulosa cell model of PCOS. Key gene structure element (KGSE) analysis revealed that most PDEGALs are primate-specific. Integrating four criteria--GTEx expression, GWAS SNPs, FANTOM promoterome, and KGSE conservation--highlighted HELLPAR as the only lncRNA fulfilling all four, while five others--PGR-AS1, MTOR-AS1, ENSG00000265179, ENSG00000256218, and LOC105377276--fulfilled three of the four criteria. ConclusionsWe have systematically identified candidate PCOS regulatory lncRNAs with convergent genetic, expression, and evolutionary evidence. These results provide a framework for functional validation and highlight lncRNAs as potential biomarkers and therapeutic targets in PCOS that function by regulating their nearby and overlapping protein-coding genes.

5
Short Interrupted Repeats Cassette (SIRC) ensembles of plant genomes reflects evolutionary route

Gorbenko, I. V.; Scherbakov, D. Y.; Zverintseva, K. M.; Konstantinov, Y. M.

2026-03-30 plant biology 10.64898/2026.03.27.714674 medRxiv
Top 0.6%
7.2%
Show abstract

Short Interrupted Repeats Cassettes (SIRC) are recently discovered eukaryotic DNA elements possessing many traits of satellite DNA and mobile genetic elements, and consisted of short direct repeats interspersed with diverse spacer sequences. The SIRC ensemble of individual species is highly heterogenous and cannot be studied using alignment methods. It was found that number of similar SIRC sequences in a given pair of species is in general correlated with their taxonomic distance, and, at the same time, closely related species can possess very diverged SIRC ensembles, which makes SIRC evolutionary pattern closer to mobile genetic element type. The SIRC sequences make up clusters with comparable sequence patterns, that are likely to demonstrate doublet evolutionary model which strongly supports that the SIRC structure is supported by the evolutionary selection. Several SIRC sequences of Arabidopsis were found to be of ancient origin with traceable evolution history as far as to the moss clade. We carried out unbiased detection of SIRC ensembles in 10 plant genomes and found that, despite very high intraspecies heterogeneity, SIRC sets possess strong interspecies phylogenetic signal. Key messageShort Interrupted Repeats Cassettes are elements of ancient origin, and could potentially be used to trace organism history, and to facilitate syntheny and Hi-C analysis.

6
Exploring transcriptomic and genomic latent variable correction approaches in differential expression analysis.

Appulingam, Y.; Jammal, J.; Ali, A.; Topp, S.; NYGC ALS Consortium, ; Iacoangeli, A.; Pain, O.

2026-04-08 bioinformatics 10.64898/2026.04.07.716914 medRxiv
Top 0.6%
7.1%
Show abstract

BackgroundDifferential expression analysis is a central tool for studying the biological processes altered in human diseases via transcriptomic signatures. However, transcriptomic datasets are systematically confounded by latent variables from two distinct sources: unmeasured technical and biological heterogeneity within the expression data, and expression differences driven by population stratification. Correction using expression-based surrogate variables (SVs) and genotype-based principal components (PCs) addresses these sources independently, yet no study has directly evaluated their combined use against either method alone within a differential expression framework. In this study we hypothesised that simultaneously including both correction layers would produce more biologically valid and reproducible results than either approach alone, and tested this in two independent RNA-seq datasets of amyotrophic lateral sclerosis (ALS) cases and controls with matching genotype data. ResultsFour nested differential expression models (corrected for PC-only, SV-only, both SV and PC, and neither PCs nor SVs) were evaluated across the KCLBB (96 cases and 52 controls) and ALS Consortium (272 cases and 35 controls) datasets. Models were evaluated on: cross-dataset effect size concordance, cross-dataset replicability quantified by the Jaccard Similarity Index, and biological recall against a curated reference set of 66 known ALS genes. The combined SV+PC framework consistently outperformed simpler models across all metrics. Replicability improved nearly ten-fold compared to the non-corrected model, (Jaccard index: 2.28% to 19.5%), and the combined framework exhibited a statistically significant 2.1% gain over the SV-only model. The biological recall ALS genes recovered doubled comparing to the SV correction alone. Crucially, effect size stability was preserved, with the combined model expanding the shared transcriptomic signal without sacrificing consistency. These findings remained generally robust to PC number in sensitivity analyses. ConclusionsThis study found that SVs and genotype PCs address non-redundant sources of confounding, and we recommend their combined use as standard practice in differential expression analysis where matched genotype data are available. Notably PCs capturing population structure can also be derived directly from RNA-seq data, extending the applicability of this framework to studies lacking matched genotype data. Although this analysis was restricted to ALS datasets, we expect these findings to generalise to other traits.

7
Daily feeding rhythms may play a role in the genetic variability of feed efficiency in growing pigs

Gilbert, H.; Foury, A.; Agboola, L.; Devailly, G.; Gondret, F.; Moisan, M.-P.

2026-04-21 zoology 10.64898/2026.04.17.719142 medRxiv
Top 0.6%
7.0%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWImproving feed efficiency in pigs is essential for reducing production costs and environmental impacts. This study examines the influence of circadian feeding rhythms and genetic polymorphisms on feed efficiency variability using two pig lines divergently selected for Residual Feed Intake (RFI) over ten generations. Feeding behavior was monitored using automatic concentrate dispensers, recording 6,494,097 visits from 3,824 pigs to analyze meal frequency, duration, and diurnal patterns. LRFI pigs ate less frequently, with larger meals and longer durations, they exhibited two distinct feeding peaks: one around 8:00 AM and a higher one at 5:00 PM and they consumed more feed during the diurnal period and less at night. HRFI pigs showed a smoother, less rhythmic feeding behavior with increased nocturnal intake. The differences between the two RFI lines became more pronounced as the number of generations of selection increased, suggesting a genetic basis. Feeding behaviors, including intake during the two main diurnal peaks, were found to be heritable (heritability estimates: 0.30-0.40) and genetic correlations were observed between feed intake and RFI, especially for intake between the two peaks. Then, we investigated the evolution of allele frequencies of single nucleotide polymorphisms (SNPs) in DNA sequences surrounding 10 core clock genes (ARNTL, CLOCK, CRY1, CRY2, NPAS2, NR1D1, PER1, PER2, PER3, RORA) along generations of selection. SNPs with significant frequency changes were mapped to regulatory regions and transposable elements, especially in HRFI line, suggesting potential functional impacts on circadian regulation. These results underscore the role of feeding behavior and genetic variation in feed efficiency, offering insights for breeding programs aimed at improving metabolic efficiency and sustainability in pig production.

8
Attentive-SPIDNA: Attention-based neural networks for population genetics

Sanchez, T.; Jobic, P.; Regan, C.; Verdu, P.; Charpiat, G.; Jay, F.

2026-04-18 evolutionary biology 10.64898/2026.04.15.718687 medRxiv
Top 0.7%
6.7%
Show abstract

Artificial neural networks (ANNs) have recently offered new perspectives to solve inference problems from high dimensional data in numerous scientific fields, but it is yet unclear which architectures are the most suited to genomic data. Here, we present a new ANN architecture integrating attention mechanisms to infer effective population size history from genomic data. Built upon our previous exchangeable architecture SPIDNA, Attentive-SPIDNA adds attention layers that allow computing more expressive and complex features from combinations of haplotypes. The contribution of each haplotype to the features is learned automatically and depends on its content and affinity with the other haplotypes. Likewise, we use this mechanism to automatically perform a voting scheme that aggregates predictions from different genomic regions. This new architecture outperforms approximate Bayesian computation and previously published neural networks while relying directly on raw genetic data and being invariant to haplotype permutation in the input. As a proof-of-concept, we use this architecture to infer the effective population size history of 54 populations from the HGDP dataset (Bergstrom et al, 2020). This application highlights the ability of the network to handle data with a varying number of haplotypes and to quickly perform predictions for datasets including numerous populations. Therefore, the proposed mechanism could be integrated to various neural networks solving population genetics tasks.

9
Identification of a microRNA with a mutation in the loop structure in the silkworm Bombyx mori

Harada, M.; Tabara, M.; Kuriyama, K.; Ito, K.; Bono, H.; Sakamoto, T.; Nakano, M.; Fukuhara, T.; Toyoda, A.; Fujiyama, A.; Tabunoki, H.

2026-03-27 molecular biology 10.64898/2026.03.24.714027 medRxiv
Top 0.7%
6.5%
Show abstract

MicroRNAs (miRNAs) play essential roles in the posttranscriptional regulation of gene expression in organisms. In the process of synthesizing mature miRNAs from miRNA precursors, the miRNA precursors are cleaved via Dicer at their loop structure, after which the miRNA precursors become mature and regulate transcription. However, the consequences of altering the loop sequence are not fully understood. The silkworm Bombyx mori is a lepidopteran insect with many genetic strains. We identified a mutant of the miRNA miR-3260 whose the part of the loop structure was lacking in a silkworm strain with translucent larval skin. Here, we aimed to analyze the role of wild-type miR-3260 and the influence of the mutation of the loop structure in B. mori. First, we identified the genomic region responsible for the translucent larval skin phenotype and determined that the mutated miR-3260 nucleotide sequences. Then, we predicted the binding partners of wild-type miR-3260 using the RNA hybrid tool and found two juvenile hormone (JH)-related genes as targets of wild-type miR-3260. Next, we assessed the relationships between miR-3260 and JH and found that miR-3260 was highly expressed in the Corpora allata and its expression responded to JH treatment. Meanwhile, miR-3260 mimic and inhibitor did not induce the typical phenotypes associated with JH in B. mori. Then, we compared the dicing products from wild-type and mutant miR-3260 precursors and observed that neither form underwent Dicer-mediated cleavage when the loop structure was altered. These results suggest that loop mutations in the miR-3260 precursor may not influence dicing activity, consistent with the lack of observable phenotypic effects.

10
EGP1K: Whole-Genome Sequencing of 1,024 Egyptians Characterizes Population Structure and Genetic Diversity

Amer, K.; Moustafa, A.; Hassan, W. A.; Adel, E.; AbdElaal, K. R.; Ghanim, T. A.; Abd El-Raouf, A.; El-Hosseiny, A.; El-Sayed, A. F.; Badr, A. H.; Hassan, A.; Kotb, A.; Ragheb, A.; Muhammad, A. M.; Ali, A.; Abdelaal, A.; Ramadan, E.; El-Garhy, F. M.; El Shehaby, H.; Ali, M. A.; Albarbary, M.; Zahra, M. A.; Amer, M.; Elmonem, M. A.; Fahmy, N. T.; Abdel-Haseeb, O. M.; Hassan, T. M.; Daoud, Y. A.; Howeedy, Y.; Farouk, Y. K.; Soror, S.; El-Feky, G.; Sakr, M.; Soliman, N. A.; Gad, Y. Z.; Abdel-Ghaffar, K. A.; Egypt Genome Consortium,

2026-04-06 genomics 10.64898/2026.04.02.715521 medRxiv
Top 0.8%
6.5%
Show abstract

Middle Eastern and North African populations remain underrepresented in genomic databases, comprising less than 1% of genome-wide association study participants despite representing approximately 6% of the global population. Here we present the Egypt Genome Project (EGP1K), in which we performed whole-genome sequencing on 1,024 unrelated Egyptian individuals originating from 21 of Egypts 27 governorates, recruited through eight clinical and research centers across Upper and Lower Egypt. We identified over 51.3 million variants, of which 17.1 million (33.4%) were absent from dbSNP. Allele frequency comparisons across 6.5 million shared variants showed the strongest concordance with Middle Eastern populations ({tau} = 0.977). Principal component analysis and ADMIXTURE modeling at K = 7 revealed that Egyptians share a dominant ancestry component (71.8%) with Middle Eastern populations and carry a smaller Egyptian-enriched component (18.5%) that distinguishes them from neighboring groups. Runs of homozygosity varied substantially across subregions, with Upper Egypt showing the highest burden, paralleling elevated consanguinity rates. Carrier frequency analysis identified MEFV (Familial Mediterranean Fever) at 9.1% as the most prevalent pathogenic carrier state; when adjusted for the national consanguinity rate, MEFV carrier status alone projects approximately 6,600 affected births per year. HLA class I typing identified allele frequencies placing Egyptians within the Levantine-Eastern Mediterranean cluster, providing baseline immunogenetic data currently absent from international databases. Analysis of polygenic risk score distributions revealed substantial differences in threshold-based risk stratification between Egyptians and European reference populations. When the Europeanderived 90th percentile threshold was applied, 83.3% of Egyptians were assigned to high-risk strata for stroke, 76.4% for chronic kidney disease, and 72.8% for gout, compared to the intended 10% high-risk proportion. These distributional shifts were observed across several cardiometabolic traits (Cohens d = 1.55-1.61), while other traits showed closer cross-population concordance, indicating that the degree of threshold miscalibration varies by trait. Together, these findings establish EGP1K as a genomic reference for Egypt and indicate that European-derived risk stratification thresholds may not be directly transferable to the Egyptian population, supporting the need for population-specific calibration of polygenic risk scores.

11
Evaluating the reliability of tools for mRNA annotation and IRES studies

May, G. E.; Akirtava, C.; McManus, J.

2026-03-31 genomics 10.64898/2026.03.29.707813 medRxiv
Top 0.8%
6.4%
Show abstract

Since the discovery of viral Internal Ribosome Entry Sites (IRESes), researchers have sought to find similar elements in mammalian host genes, termed "cellular IRESes". However, the plasmid systems used to measure cellular IRES activity are vulnerable to false positives due to promoter activity in candidate IRESes. Orthogonal methods are needed to validate putative IRESes while carefully avoiding artifacts known to cause false positives. Recently, Koch et al. proposed approaches for studying IRESes, primarily circular RNA-generating plasmids, and for validating mRNA transcripts using smFISH and qRT-PCR. Here, we demonstrate confounding variables and artifacts in each of these approaches that can lead to inappropriate conclusions about potential cellular IRES activity. We show the back-splicing circRNA plasmid creates linear mRNA artifacts associated with false-positive IRES signals. Using orthogonal, gold-standard assays validated with viral IRESes, we find putative cellular IRESes reported using the back-splicing plasmid have no IRES activity. Furthermore, we demonstrate that smFISH and qRT-PCR can misidentify nuclear non-coding RNAs as mRNAs and we validate a single molecule sequencing assay for identifying genuine mRNA 5 ends. Our work establishes reliable methods for robust transcript annotation and IRES studies that avoid documented artifacts arising from bicistronic and back-splicing circRNA plasmid reporters.

12
CCIDeconv: Hierarchical model for deconvolution of subcellular cell-cell interactions in single-cell data

Jayakumar, R.; Panwar, P.; Yang, J. Y. H.; Ghazanfar, S.

2026-03-30 bioinformatics 10.64898/2026.03.26.714643 medRxiv
Top 0.9%
6.2%
Show abstract

MotivationCell-cell interaction (CCI) underlies several fundamental mechanisms including development, homeostasis and disease progression. CCI are known to be localised to specific subcellular regions, for example, within the cytoplasms of cells. With the emergence of subcellular spatial transcriptomics technologies (sST), there is an opportunity to attribute CCI to subcellular regions. We aimed to deconvolute CCI to subcellular CCI (sCCI) in non-spatial single cell transcriptomics data (i.e. scRNA-seq) datasets using a modified CCI score from CellChat. ResultsBy calculating the sCCI score specific to cytoplasm and nucleus in nine publicly available sST datasets, we identified unique nucleus-nucleus and cytoplasm-cytoplasm sCCI. Then, we deconvolved the communication score to subcellular regions by using a hierarchical classification and regression model which we name as CCIDeconv. We performed leave-one-dataset-out cross-validation across nine datasets over a range of different tissue types from human samples. We observed that training across many different tissue types resulted in robust deconvolution performance in an unseen dataset. As the number of training datasets increased, models trained without spatial features achieved similar performance as models including spatial features. This implied the potential for accurate prediction of sCCI events from even scRNA-seq with large numbers of training datasets. Overall, we offer a method towards attributing CCI events to subcellular regions. This method can allow researchers in dissecting sCCI patterns to gain insights in underlying biology in a range of tissues covering health and disease.

13
Dissecting oligogenic and polygenic indirect genetic effects through the lens of neighbor genotypic identity

Sato, Y.; Hamazaki, K.

2026-04-03 genetics 10.64898/2026.03.31.715746 medRxiv
Top 1.0%
6.2%
Show abstract

Individual phenotypes often depend on the genotypes of other individuals within a group. These phenomena are termed indirect genetic effects (IGEs) and have been distinguished from direct genetic effects (DGEs) using quantitative genetic models. Recent studies have utilized high-resolution polymorphism data to enable genomic prediction (GP) and genome-wide association study (GWAS) of IGEs, but unified methods remain limited. Here we integrate polygenic and oligogenic IGEs using a multi-kernel mixed model incorporating two random effects with a single covariance parameter. Underlying this implementation, the Ising model of ferromagnetics enabled us to simplify locus-wise and background IGEs for GWAS and GP, respectively. Our simulations demonstrated that, while the previous and present models exhibited similar performance, the present model can infer a trade-off between DGEs and IGEs. By applying this method to three species of woody plants, we found evidence for intergenotypic competition in aspen and apple trees, but limited evidence in climbing grapevines. Based on GWAS, we also detected significant variants associated with the competitive IGEs on the apple trunk growth. Our study offers a flexible implementation for GWAS/GP of IGEs, thereby providing an effective tool to dissect the genetic architecture of group performance.

14
Bone2Gene: Next-generation Phenotyping of Rare Bone Diseases

Bolmer, E.; Schmidt, P.; Fischer, I.; Rassmann, S.; Ruder, A.; Hustinx, A.; Kirchhoff, A.; Beger, C.; Skaf, K.; Fardipour, M.; Hsieh, T.-C.; Keller, A.; De Rosa, A.; Kalantari, S.; Sirchia, F.; Kotnik, P.; Born, M.; Solomon, B. D.; Waikel, R. L.; Tkemaladze, T.; Abashishvili, L.; Melikidze, E.; Sukhiashvili, A.; Lartsuliani, M.; Nevado, J.; Tenorio, J.; Juergens, J.; Lindschau, M.; Lampe, C.; Moosa, S.; Pantel, J. T.; Mattern, L.; Elbracht, M.; Luk, H.-M.; Travessa, A.; De Victor, J.; Alhashim, M.; Alhashem, A.; AlKaabi, N.; Kocagil, S.; Akbas, E.; Kornak, U.; Rohrer, T.; Pfaeffle, R.; Soucek,

2026-03-27 genetic and genomic medicine 10.64898/2026.03.25.26349289 medRxiv
Top 1%
4.9%
Show abstract

Background: Diagnosing the over 700 known rare bone diseases (RBDs) is inherently challenging and often requires extensive time and multiple clinical visits. Effective treatment, particularly for RBDs with approved therapies, depends on early and precise identification of the specific RBD type. Image recognition artificial intelligence (AI) has the potential to significantly enhance diagnostic processes and improve patient outcomes. Many of these disorders cause characteristic skeletal changes, especially in the hands, and are associated with growth abnormalities. Consequently, affected children routinely undergo hand radiographs for bone age assessment, making these images a widely available yet underutilized diagnostic resource. Materials and Methods: We retrospectively compiled 5,623 multi-institutional hand radiographs from 2,471 patients with 45 different RBDs and 1,382 unaffected controls. We trained two deep learning models: a binary classifier to differentiate between RBD and non-RBD hand radiographs, and a multi-class classifier covering ten RBDs (or RBD groups), using 5-fold cross-validation. Preprocessing included masking, normalization, and data augmentation. Additionally, we applied occlusion sensitivity mapping to visualize class-specific features and evaluated the learned representations through cosine-based retrieval and UMAP projections of the feature space. Results: The affected versus unaffected classifier achieved a balanced accuracy of 85.5% on the test dataset. The ten-class classifier reached a balanced (top-1) accuracy of 76.6%, with top-3 accuracy exceeding 90%. Disorders with highly distinctive phenotypes, such as achondroplasia, achieved accuracies above 95%, whereas phenotypically overlapping disorders, such as ACAN- and SHOX-related short stature, were more frequently confused. Feature space analysis showed that validation samples clustered closely with their respective training distributions, supporting the consistency and generalizability of the learned embeddings. Conclusion: This manuscript presents a proof of principle for the development of Bone2Gene, a next-generation phenotyping (NGP) tool for the detection and differential diagnosis of RBDs, currently based on hand radiographs. Ongoing efforts focus on expanding the dataset to include additional RBDs or RBD groups in the current multi-class classifier for differential diagnosis and to further evaluate its generalizability. The Bone2Gene study is open to collaboration.

15
Proteomic Insights into Lp(a) Cardiovascular Mechanisms: A Mendelian Randomization Study

Tomasi, J.; Xu, H.; Zhang, L.; Carey, C. E.; Schoenberger, M.; Yates, D. P.; Casas, J.

2026-04-22 genetic and genomic medicine 10.64898/2026.04.20.26351299 medRxiv
Top 1%
4.4%
Show abstract

Background: Elevated lipoprotein(a) [Lp(a)] is a known risk factor for several cardiovascular-related diseases established from multiple genetic and observational studies. However, the underlying mechanisms mediating the effects of Lp(a) levels on cardiovascular disease risk and major adverse cardiovascular events (MACE) are unclear. The aim of this study was to identify proteins downstream of Lp(a) using mendelian randomization (MR) - a genetic causal inference approach. Methods: A two-sample MR was performed by initially identifying Lp(a) genetic instruments based on data from genome wide association studies (GWAS) of Lp(a) blood concentrations. These instruments were then tested for association with proteins from proteomic pQTL data (Olink from UK Biobank, 2940 proteins and SomaScan from deCODE, 4907 proteins). Results: A total of 521 proteins associated with Lp(a) were identified. Using pathway enrichment analysis, the following MACE-relevant pathways were identified comprising a total of 91 Lp(a) downstream proteins: oxidized phospholipid-related, chemotaxis of immune cells and endothelial cell activation, pro-inflammatory monocyte activation, neutrophil activity, coagulation, and lipid metabolism. Conclusion: The results suggest that the influence of Lp(a) treatments is primarily through modifying inflammation rather than lipid-lowering, thus providing insight into the mechanistic framework which mediates the effects of elevated Lp(a) on atherosclerotic cardiovascular disease.

16
Genetic Impacts on Variability of Body Fat Distribution Uncover Gene-Environment and Gene-Gene Interactions

Zhang, X.; Joehanes, R.; Ma, J.; Pain, O.; Levy, D.; Westerman, K.; Bell, J. T.

2026-04-02 genetics 10.64898/2026.03.31.715615 medRxiv
Top 1%
4.3%
Show abstract

Body fat distribution is a strong predictor of cardiometabolic disease risk. Gene-environment and gene-gene interactions can affect body fat distribution, resulting in differential phenotypic variance across genotype groups that can be detected through variance quantitative trait loci (vQTLs). Using UK Biobank MRI data in conjunction with genetic data, we explored evidence for vQTLs for body fat distribution phenotypes aiming to uncover potential genetic interactions. We identified three vQTLs for liver fat distribution, including rs738408 (PNPLA3), rs4293458 (APOE), and rs58542926 (TM6SF2), and one vQTL region (FTO) for abdominal subcutaneous adipose tissue. To dissect putative gene-environment and gene-gene interactions underlying these signals, we identified multiple vQTL-environment interactions and one epistatic effect (rs58542926*rs429358) for liver fat. The vQTLs and interaction results were validated in multiple UK Biobank and external replication cohort datasets (Framingham Heart Study, All of Us, and TwinsUK), showing replication of the three liver vQTLs with the greatest reproducibility for vQTL rs738408. Our findings uncover vQTLs and underlying interaction effects on body fat distribution, especially liver fat, that may be useful for the development of precision medicine approaches.

17
Network-Based Functional Fragility Reveals System-Level Reorganization Of The Gut Microbiome In Inflammatory Bowel Disease

Kenavdekar, M. V.; Natarajan, E.

2026-04-21 bioinformatics 10.64898/2026.04.16.719113 medRxiv
Top 1%
4.2%
Show abstract

The human gut microbiome plays a critical role in host health, yet its functional organization in disease remains poorly understood. Most studies focus on taxonomic composition or pathway abundance, which fail to capture higher-order interactions governing system-level behavior. Here, we investigated microbiome functional organization in inflammatory bowel disease (IBD), including Crohns disease (CD), ulcerative colitis (UC), and healthy controls (HC), using a network-based framework across 60 metagenomic samples. Functional pathway profiles were used to construct correlation-based interaction networks, followed by analysis of network topology, functional redundancy, keystone pathway architecture, and system robustness. Disease-associated networks (CD and UC) exhibited reduced global connectivity, increased modular fragmentation, and centralization of keystone pathways, indicating a shift from distributed organization to more fragmented and fragile network structures compared to healthy controls. Notably, machine learning models demonstrated that network-derived features achieved higher classification performance (accuracy up to 0.824) compared to redundancy-based measures. These findings reveal that microbiome dysfunction in IBD is driven by large-scale reorganization of functional interaction networks rather than loss of functional capacity. This study highlights the importance of network-level analysis in understanding microbiome-associated disease and provides a systems-level framework for future research.

18
Knob K180 Constitutive Heterochromatin Of Maize Exhibit Tissue-Specific Chromatin Senstitive Profiles Distinct From Other Types Of Heterochromatins

Sattler, M. C.; Singh, A.; Bass, H. W.; Mondin, M.

2026-04-04 genetics 10.64898/2026.04.01.715864 medRxiv
Top 1%
4.2%
Show abstract

BackgroundMaize knobs are regions of constitutive heterochromatin that are readily identified in both meiotic and somatic chromosomes. These structures have been characterized as stable throughout the cell cycle, exhibiting late replication during the S-phase, and are composed of two specific families of highly repetitive DNA sequences: K180 and TR-1. Although widely used as cytogenetic markers due to their variability in number and chromosomal position across inbred lines, hybrids, and landraces, little is known about their chromatin structure and dynamics. In this study, we analyzed chromatin accessibility of knobs using DNS-seq data across four maize tissues representing distinct developmental stages. ResultsOur results reveal that K180 knobs exhibit tissue-specific variation in chromatin accessibility, transitioning between open and closed states during development. In contrast, the TR-1 knob of chromosome 4 remained consistently inaccessible across all tissues analyzed. A knob composed of both K180, and TR-1 further supported this observation, with only the K180 region showing dynamic accessibility. To validate these findings, we also analyzed other repetitive regions such as centromeres, which showed a uniformly closed chromatin structure similar to TR-1. These results suggest a unique developmental modulation of chromatin accessibility associated with K180 repeats. While the chromatin accessibility of knobs does not reach the levels observed at Transcription Start Sites (TSS), the comparison among different classes of repetitive DNA within maize constitutive heterochromatin provides compelling evidence for sequence-specific and tissue-specific chromatin dynamics. ConclusionsOur findings uncover a previously unrecognized property of maize knobs and establish a reference for future studies on chromatin organization and epigenetic regulation of repetitive DNA in plant genomes.

19
Detecting context-dependent selection on cancer driver genes with DiffDriver

Zhou, J.; Zhang, Q.; Song, L.; He, X.; Zhao, S.

2026-04-09 genomics 10.64898/2026.04.06.716771 medRxiv
Top 1%
4.2%
Show abstract

Positive selection on somatic mutations is the driving force for cancer progression. Growing evidence shows that the emergence of a driver mutation in a tumor sample depends on individual-specific factors, for example environmental exposures or the individuals germline genetic background. We term these individual-level factors as the "contexts" of a tumor. Our hypothesis is that mutations in a driver gene can bring different growth advantages in different contexts, resulting in "differential selection" on these genes in varying contexts. Identifying which contexts modulate selection strength provides critical insights into the selection forces driving tumorigenesis. However, due to the sparsity of somatic mutations and heterogeneous background mutational process across positions and individuals, identification of differential selection has limited power with current statistical tools and is prone to false positives. To address this, we developed a powerful statistical method, DiffDriver, that identifies associations between "contexts" and selection strength on a driver gene across individuals. DiffDriver accounts for variations of mutation rates across bases and individuals, while taking advantage of functional information of sequences to improve the power. Through simulations, we show DiffDriver reduces false positives and boosts power compared to current methods. Our results highlight that multiple individual-level factors create significant heterogeneity in the strength of selection acting on driver genes and 33% of driver genes showed differential selection in at least one of the contexts studied, including tumor clinical traits and tumor immune microenvironment subtypes. These results provided new insights into the context-dependent forces driving cancer evolution.

20
Joint modeling of social genetic effects in mono- and pluri-specific groups: case study in intercrops

Salomon, J.; Enjalbert, J.; Flutre, T.

2026-03-31 genetics 10.64898/2026.03.27.714849 medRxiv
Top 2%
4.0%
Show abstract

The genetics of interspecific groups remains largely unexplored, despite the central role of social (or indirect) genetic effects in shaping phenotypic expression within communities. Intercropping, i.e. the simultaneous cultivation of multiple crop species in the same field, offers a powerful model to harness these interspecific social effects. Such species mixtures provide well-documented agricultural benefits, yet few breeding frameworks have integrated the genetics of social interactions. Here, we address this gap by extending quantitative genetic theory to interspecific groups, with intercropping as a concrete and applied model case. We propose a quantitative genetic model that jointly analyzes intra and interspecific interactions within a unifying framework. Breeding values are decomposed into a direct component, shared in mono and mixed-crops, an interspecific social component corresponding to the effect of one species on another, and an intraspecific component that captures the social effects within a mono-genotypic stand of cloned plants. Statistically, this consists in simultaneously fitting several linear mixed models, one per stand type, all having direct breeding values in common. As no open-source software can fit such a complex mixed model, we provide such an implementation in R/C++. Simulations across various genetic (co)variance structures and sparse experimental designs showed accurate estimation of all genetic (co)variances and breeding values. With an incomplete, yet balanced design combining sole crops and intercrops, genetic gains in both systems were achievable simultaneously, enabling breeding strategies that progressively integrate intercropping into existing, sole-crop-only schemes. More broadly, this framework allows dissecting direct and social genetic effects when genotypes are observed in mono- and mixed-species situations, cultivated or not.